System Debugging

by Russ Kepler

A fact of life in our modern computer age is that the more parts involved in a system, the more chance of encountering a failure in one or more of the parts. Deciphering the clues during debugging can pose a real challenge. The intent of this article is to give you a methodical approach to isolating a problem, step-by-step, until the trouble spot can be located and, with any luck, fixed or replaced.

Below are the three key steps to system debugging.

Isolate the cause of the problem using a divide-and-conquer strategy.
Know what's new or has changed within your system.
Identify the problem and be able to produce it on demand.

DIVIDE-AND-CONQUER

The first item of interest in any failure is to isolate just when it first happened. Often one component takes the blame when it's really another component at fault.

As an example, BASIS does receive calls after developers upgrade their systems unsuccessfully. On going through the system changes, the developer may note that the only thing done was a complete change of the motherboard, saving only the memory from the old motherboard. As the system starts failing, BASIS may receive a call inquiring about what's wrong with BB^x. Ultimately many of these problems are caused by an overzealous BIOS tuner attempting to squeeze that last cycle of performance from the system, and simply using the default BIOS parameters will effect an immediate cure.

When a failure is noted it's usually necessary to isolate it beyond the component level. Isolating it down to a subsystem level or to a driver level is good.

KNOW WHAT'S NEW OR HAS CHANGED WITHIN YOUR SYSTEM

To effectively isolate system components, it's a wise idea to make a list. Start by putting everything back to the way it was before things started breaking down. This task can be difficult at times, but is useful in confirming the location of the failure.

Typical in these cases is a network. Network failures can be puzzling because they can appear randomly and sometimes affect workstations not causing the problem. Here the diagnostic tool best suited is wholesale replacement of machines swapping groups of machines around. If machine A has a problem at site A, try swapping it with an identical machine from another site. If the problem moves with a particular machine, start debugging it. Or, look at the network wire the machine is on, the isolated run to site A. Sometimes it's easiest to simply lay in a parallel cable to make sure that it's not the cable, or where the cable is run.

IDENTIFY THE PROBLEM AND PRODUCE IT ON DEMAND

Cable errors can be particularly vexing as a marginal cable or a bad connection will cause an error to come and go. It is critical that you check any assumptions you make from the available data. If you decide that a particular machine exhibits a problem, check the machine to make sure you can produce the problem on demand. If you can't, then you need to sit back and think about the problem. Perhaps there is another cause.

Sometimes cable errors are simple problems. A number of years ago I was asked to check into a problem at a customer site where the customer said they saw product descriptions change when entering orders. Only when I got there, did they mention the problem existed only in the warehouse. There was a problem. (I was able to help.)The customer had run a serial cable about 300 feet over fluorescent lights. Once the cable location was changed, everything ran smoothly. (I imagine that you'd be able to hear the AC hum on the cable had we connected a speaker.)

So, in short, to solve a particular problem:

Isolate the cause of the problem using a divide-and- conquer strategy.
Know what's new or has changed within your system.
Identify the problem and be able to produce it on demand.